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The analysis of the evolutionary dynamics of a population with many polymorphic loci is challeng- 
ing since a large number of possible genotypes needs to be tracked. In the absence of analytical 
solutions, forward computer simulations are an important tool in multi-locus population genetics. 
The run time of standard algorithms to simulate sexual populations increases as 8 L with the 
number L of loci, or with the square of the population size N. We have developed algorithms that 
allow to simulate large populations with a run-time that scales as 3 L . The algorithm is based on 
an analog of the Fast-Fourier Transform (FFT) and allows for arbitrary fitness functions (i.e. any 
cpistasis) and genetic maps. The algorithm is implemented as a collection of C-| — h classes and a 
Python interface. Availability: http://code.google.eom/p/ffpopsim/ 



Introduction 

Forward simulations of population genetics, track ei- 
ther the genotype of every individual in the population, 
or the number of individuals that carry a particular geno- 
type. The former has been implemented in a number of 
very flexible simulation packages ( Guillaume and Rouge 



mont| [2006] |Peng and Kimmel[ [2005 ; Spe ncer and Coop[ 



2004). In large populations with a moderate number of 



loci, storing the abundance of all possible 2 L genotypes 
is often faster. Simulating such large populations with a 
small number of loci is for example essential when study- 
ing the evolution of drug resistance in viral or bacterial 
pathogens (Weinreich et al. 2006). 



Individual-based population genetic simulations are 
quite straightforward and usually employ a discrete gen- 
eration scheme in which processes such as mutation, se- 
lection, and migration are applied at every generation 
to every individual. Individuals are then paired up via 
a mating scheme and recombinant offspring is produced. 
Existing toolboxes often emphasize biological realism and 



allow the user to specify comp lex life cycles, see e.g. |Guil- 
laume and Rougemont (2006). Our emphasis here is on 



efficient simulation of large populations. Instead of track- 
ing individuals, we keep track of the distribution P(g) 
of gametes across all possible 2 L genotypes, denoted by 
g = (si, . . . , si,) where Sj = 0/1. This genotype distri- 
bution changes due to mutation, selection and recombi- 
nation. The former two are again straightforward and 
require at most L 2 L operations (each genotype can mu- 
tate at any one of the L loci). In our implementation, 
selection acts on haploid gametes, precluding dominance 
effects. Recombination, however, is a computationally 
expensive operation since it involves pairs of parents (of 
which there are 4 L ) which can combine their genome in 
many different ways (2 L ). As a consequence, a naive im- 
plementation requires 8 L operations to calculate the dis- 
tribution of recombinant genotypes. It is intuitive that 
the complexity of this algorithm can be reduced: given 
a recombination pattern, only a fraction of the genome 



is passed on in sexual reproduction and all genotypes 
that agree on that fraction contribute identically. We 
will show below that exploiting this redundancy allows 
to reduce the number of operations from 8 L to 0(3 L ). 

After selection, mutation, and recombination, the pop- 
ulation distribution P(g) contains the expected number 
of individuals of genotype g in the next generation. For 
stochastic population genetics, we still need to resample 
the population in way that mimics the randomness of re- 
production. This is achieved by resampling individuals 
according to a Poisson distribution with mean NP(g) for 
each genotype, which will result in a population size of 
approximately N ± 0(y/N). 



Recombination via FFT 

The probability of producing a genotype g by recom- 
bination is 

%) = EE c «) F (» m )^ p ). (i) 

where £ := • • • , £l) specifies the particular way the 
parental genomes are combined: £i = (resp. 1) if locus 
i is derived from the mother (resp. father). The genotype 
g' is summed over; it represents the parts of the mater- 
nal (g m ) and paternal (g p ) genotypes that are not passed 
on to the offspring. We can decompose each parent into 
successful loci that made it into the offspring and wasted 
loci, as follows: g p = £,Ag + £Ag' and g m = £Ag + £Ag', 
where A and a bar over a variable indicate respectively 
the elementwise AND and NOT operators. The function 
C assigns a probability to each inheritance pattern, see 
Neher and Shraiman (2011 ) for a more detailed explana- 



tion. In a facultatively sexual population, a fraction r 
of P{g) is replaced by R(g), while r = 1 in an obligate 
sexual population. 

The central ingredient for the efficient recombina- 
tion is a fast-Fourier algorithm for functions on the L- 
dimensional binary hypercube. Every function on the 
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Figure 1 Calculating the Fourier transform in L 2 L opera- 
tions. Arrow going up indicate addition, going down sub- 
straction. For the general L dimensional hypercubes L cycles 
are necessary where terms differing at different bits are com- 
bined. 



hypercube can be expressed as 
F( ff ) = /<°>+5>/ i (1) + 



, titjf. 
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where t{ := (2sj — 1) and takes the values ±1. In total, 

j (k) 

there are 2 L coefficients ' iu for every subset of k loci 



out of a total of L loci (Weinberger 1991). Similarly, 

(k) 

each coefficient f> ' ■ is uniquely specified by 

/;':,. ' Y.i 1 ( 3 ) 

a 

These nominally 4 L operations can be done in L 2 L via 
the FFT scheme illustrated in Fig. [2j 

With some algebra (see online supplement), one can 
show that the generic Fourier coefficient of R(g) is given 
by 



r 
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where the sum runs over all partitions of i\ . . . ik into 
groups of I and k — I denoted by j :— (ji, ■ ■ . ,ji) and 
h := (hi, . . . , hk-i). Variables such as • are the 
Fourier coefficients of the genotype distribution, P(g), 
and the crossover function C is expanded into 
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The latter coefficients can be calculated efficiently by re- 
alizing that for k = L, there is exactly one term unequal 
to zero. All subsequent terms can be calculated by suc- 
cessive marginalization of unobserved loci. In total, cal- 
culating t requires 0(2 k ) operations. Since there 
are (^) terms of order k, the entire calculation requires 
0(3 L ) operations. In case of single crossover recombina- 
tion, the algorithm can be sped up further to 0(L2 L ). 



test routines are provided. As an example, we discuss 
here the problem of fitness valley crossing, which has re- 
ceived attention recently in the popula t ion ge netics liter- 
ature ( |Lynch[ |2010| |Weissman et al.\ |2010[ ). Consider 
a fitness landscape where the wild-type genotype has 
(malthusian) fitness s±, while the quadruple mutant has 
fitness si +S2- All intermediate genotypes have the same 
slightly deleterious fitness (— si relative to wild- type). 
The time required for crossing the valley can be com- 
puted by the following routine: 

import FFPopSim 

L = 4 # number of loci 

N = lelO # population size 

# create population and set rates 
c = FFPopSim. haploid_lowd(L) 

c . set_recombination_rates( [0 . 01] * (L-l)) 
c . set_mutation_rate (le-6) 

# start with wildtype: ObOOOO = 
c . set_genotypes ( [ObOOOO] , [N] ) 

# set positive relative fitness for wildtype 

# and quadruple mutant: Obllll = 15 
c.set_fitness_function( [ObOOOO, Obllll] , 

[si, sl+s2]) 

# evolve until the quadruple mutant spreads 
while c . get_genotype_f requency (Obllll) <0 . 5 : 

c. evolve (100) 
print c. generation 

The runtime and memory requirements of 3 L still pre- 
clude the simulation of more than L = 20 loci. For this 
reason, we also include a streamlined individual based 
simulation package with the same interface that can sim- 
ulate arbitrarily large number of loci and has an overall 
runtime and memory requirements 0(NL) in the worst 
case scenario. To speed up the simulation in many cases 
of interest, identical genotypes are grouped into clones. 
This part of the library was developed for whole genome 
simulations of large HIV populations (L = 10 4 , N > 10 5 ) 
and a specific wrapper class for HIV applications is pro- 
vided. As of now, the library does not support domi- 
nance effects which would require a fitness function that 
depends on pairs of haploid genomes. Such an extension 
to diploid populations is straightforward. 
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Appendix A: Prerequisites and compiling 

The toolbox makes extensive use of the GNU scientific library (http://www.gnu.org/softwcire/gsl/) and the 
Boost C++ library (http://www.boost.org/). The Python wrapper further requires NumPy, SciPy and MatplotLib 
(http://www.scipy.org). If all of these are installed and the appropriate path are set, FFPopSim can be compiled 
using Make. Installation instructions are provided in the INSTALL file. The building process creates files inside the 
folder pkg; CH — h headers are created in pkg/include, the static C++ library in pkg/lib, and the Python module 
files in pkg/python. 



Appendix B: Description of FFPopSim 

FFPopSim contains two packages of C++ classes and Python wrappers for forward population genetic simulations. 
One for large populations and relatively few loci (L < 20), another one for longer genomes. The former is called 
hapoid_lowd and tracks the abundance of all possible genotypes. The other one is called haploid_highd and tracks 
only genotypes present in the population. The latter only allows for limited complextity of fitness functions and 
crossover patterns. These two parts of the library have very similar syntax but work quite differently under the hood. 
We will therefore describe them separately below. 

A complete documentation in html is generated automatically from the source using Doxygen and can be found in 
doc/html/ index . html. 



1. FFPopSim for few loci 

a. Specification of the population and the fitness function 

Since we assume that each locus is in one of two possible states Si = 0/1, the genotype space is an L dimensional 
binary hypercube. The population is a distribution of individuals on that hypercube, and so are the mutant and 
recombinant genotypes. Also the fitness function F(g), which assigns a number to each genotype g — {si, . . . ,sl}, 
is a function on the hypercube. For this reason, hapoid_lowd makes extensive use of a class hypercube_lowd that 
stores an L dimensional hypercube and implements a number of operations on the hypercube, including a fast-Fourier 
transform (FFT). 

Every function on the hypercube can be expressed as 

^) = / (0) + E^ (1) +E^ ) + --- ( B1 ) 

i i<j 

where ti :— 2si — 1, i.e., simply a mapping from {0, 1} to {—1, +1}- There are (^) coefficients f^' i for every subset 
of k loci out of L loci, so in total 2 L coefficients (Weinberger 1991 1. A coefficient i is uniquely specified by 
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These nominally 4 L operations (2 L for each coefficient) can be done in L 2 L via the FFT scheme illustrated in Fig. [2] 
Both the forward and reverse transform are implemented in hypercube_lowd. 

An instance of hypercube_lowd can be initialized with the function values of the hypercube or with its Fourier 
coefficients. The population class, haploid_lowd, holds instances of hypercube_lowd for the population, the mutant 
genotypes, the recombinant genotypes and the fitness function. 
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Figure 2 Calculating the Fourier transform in L2 operations. Arrow going up indicate addition, going down substraction. 
For the general L dimensional hypercubes L cycles are necessary where terms differing at different bits are combined. 



From a practical point of view, an instance of a low-dimensional population is initialized in three steps, 

(a) the class is instantiated 

haploid_lowd : : haploid_lowd ( int L=l, int rng_seed=0) 

where L is the number of loci and rng_seed is a seed for the random number generator (a random seed by default); 

(b) the initial population structure is set by the functions 

int haploid_lo wd :: s e t _al 1 e 1 e _ fr e que n c i e s ( double *freq, unsigned long N) 
int haploid_lowd :: set .genotypes ( vector <index_value _p air _t > gt ) 
int haploid_lowd :: set _wildtype ( unsigned long N) 

set the population size and composition. The first function initializes random genotypes in linkage equilibrium 
with the specifiec allele frequencies f req, the second explicitely sets a number of individuals for each genotype 
using the new type index_value_pair, and the last one sets a wildtype-only population of size N; 

(c) the fitness hypercube is initialized directly by accessing the attribute 

haploid_lowd : : fitness 

in the population class. 



b. Evolution 

In traditional Wright-Fisher type models, in each generation, the expected frequencies of gametes with a particular 
genotype after mutation, selection, and recombination are calculated and then the population resampled from this 
gamete distr ibution. We will now outline the steps required to update the population. A more detailed discussion can 
be found in (Neher and Shraiman 2011). All the following steps are called by either one of the following functions: 

• int haploid_lowd : : evolve ( int gen = l) 

updates the population looping over a specified number of generations gen; 

• int haploid_lowd : : evolve_norec ( int gen = l) 

is an alternative version that skips the resampling (deterministic evolution); 

• int haploid_lo wd :: evolve _det er minis t ic ( in t gen = l) 

is another alternative that skips the recombination step (asexual evolution). 
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c. Mutation 

Let P(g) be the genotype distribution at the beginning of a generation. Denoting the mutation rate towards the 1 
or state at locus i by u^°, the expected P{g) after mutation will be 

(L-l \ L-l 
1-E< )P{9) + J2 u t ip ^9) (B3) 
i=0 / i=Q 

where ITg denotes genotype g with locus i flipped from to 1 or vice versa. The first term is the loss due to mutation, 
while the second term is the gain due to mutation from neighbouring genotypes (in terms of Hamming distance) . 

The mutation rates can be specified by the folowing set of overloaded functions: either a single double rate (same 
for every position and in both forward and backward sense), two double rates (forward and backward rates), a L 
dimensional double array (site-specific, identical forward and backward rates), or a 2 x L dimensional double array, 

• int haploiddowd :: set _mut at ion_r at e ( double rate); 

takes a single rate and sets it for every position and in both forward and backward sense; 

• int haploid_lowd :: set _mut at ion_r at e ( double rate_forward , double rate-backward); 

takes two rates and sets rate_f orward as the forward rate (0 — > 1) and rate_backward as the backward rate 
(1^0) 

• int haploid_lowd :: set _mut at ion_r at e ( double * rates); 

takes the pointer to an array of length L and sets site-specific rates, the same for forward and backward 
mutations; 

• int haploid_lowd :: set _mut at ion_r at e ( double **rates) 

takes a pointer to a pair of (pointers to) arrays, each of length L, which contain the site-specific rates for forward 
(rates [0] ) and backward (rates [1] ) mutations. 

d. Selection 

Selection reweighs different the population of different genotypes according to their fitness as follows 

e F{g) 

P(g) <- ttv p (9) m 

{e*) 

where (e F ) is the population average of e Ftyg ^ , which is required to keep the population size constant. The corresponding 
function is 

int haploid_lowd : : select _gamet es () 



e. Resampling and genetic drift 

For deterministic modeling, one generation would be completed at this point and one would repeat the cycle, 
starting with mutation again. For stochastic population genetics, we still need to resample the population in a way 
that mimics the randomness of reproduction. The easiest and most generic way to do this is to resample a population 
of size N using a multinomial distribution with the current P(g) as sampling probabilities of different genotypes. 
Alternatively, one can sample individuals according to a Poisson distribution with mean NP(g) for each genotype, 
which will result in a population of approximately size N ±0(y/~N). For large populations, the two ways of resampling 
are equivalent and we chose the latter (much faster) alternative. The function 

int haploid_lowd :: resample ( ) 

samples the next generation the expected genotype frequencies. The expected population size used in the resampling 
is the carrying capacity. 
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f. Mating and recombination 

The computationally expensive part of the dynamics is recombination, which needs to consider all possible pairs of 
pairs of parents and all different ways in which their genetic information can be combined. In a facultatively sexual 
population, a fraction r of the individuals undergo mating and recombination. In obligate sexual populations, r = 1. 
The genotype distribution is updated according to the following rule: 

P{g) <— (l-r)P(g)+rR(g) (B5) 

The distribution R(g) of recombinant gametes would naively be computed as follows: 

%)-EE C K) P (9 ra )H9 P ), (B6) 

€ 9' 

where £ specifies the particular way the parental genomes are combined: £j = (resp. 1) if locus i is derived from 
the mother (resp. father). The genotype g' is summed over; it represents the part of the maternal (g m ) and paternal 
(g p ) genotypes that is not passed on to the offspring. We can decompose each parent into successful loci that made 
it into the offspring and wasted loci, as follows: g p = £ A g + £ A g' and g m = £ A g + £ A g' , where A and a bar over a 
variable indicate respectively the elementwise AND and NOT operators (i.e., £j := I — The function C assigns a 
probability to each inheritance pattern. Depending on whether the entire population undergoes sexual reproduction 
or only a fraction r of it, the entire population or a fraction r is replaced with R(g). The central ingredient for the 
efficient computation of R(g) is the Fourier decomposition introduced above. The generic Fourier coefficient of R(g) 
is given by 

rliU = 2 ~ L E *h •■■*«*( E E C(i)P(g m )P{9 p ) I (B7) 

9 \ t g' J 

Just as g p and g m can be expressed as a combination of g and g' ', we can invert the relation and express the generic U 
as a function of g p and g m , as follows: tj = £,t™ + £i if . Using this new basis and exchanging the order of summations, 
we obtain 

=2" L E C (£) E te 1 ^+67C)---te fc C+^C) P (3 m ) P (5 P )- ( Bg ) 

Notice that C(£) can be pulled out of the two inner sums, because the odds of inheriting a certain locus by the 
mother/father is independent of what their genetic makeup looks like. Next we expand the product and introduce 
new labels for compactness, 

^=2- L E C K)E P ™E E (B9) 

€ s m .s p «=0{ji},{/i 4 } 

where Z is the number of loci inherited from the mother among the k in (ii, . . . , Z runs from (everything happens 
to be contributed by the father) to k (everything from the mother), {ji} and {hi} are all (unordered) partitions of i 
into sets of size Z and k — Z, respectively. Now we can group all £j in the inner sum with C(£), all t™ with P(g m ) 1 and 
all if with P{g p ). The three sums (over £, g m , and g p ) are now completely decoupled. Moreover, the two sums over 
the parental genotypes happen to be the Fourier decomposition of P(g). Hence, we have 

k 

r il--ik - 2^ U ji...ji,h 1 -hk-l P h-jl P h 1 ...h k -i- ; 

'=0 {ji},{/n} 

The quantity 

can be calculated efficiently, for each pair of partitions ({ji}, {hi}), by realizing that (a) for k = L, there is exactly 
one term in the sum on the right that is non-zero and (b) all lower-order terms can be calculated by successive 
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marginalizations over unobserved loci. For instance, let us assume that k = L — 1 and that the only missing locus is 
the m-th one. We can compute 

jl—jlM—liL-i-i ji...jim,h 1 ...h L - 1 -i ' j 1 ...ji,h 1 ...h L - 1 -i m - \ I 

There are ffl ways of choosing k loci out of L, which can be inherited in 2 fc different ways (the partitions in j and h 
in Eq. (Bit )) such that the total number of coefficients is 3 L . Note that these coefficients are only calculated when 



the recombination rates change. Furthermore, this can be done for completely arbitrary recombination patterns, not 
necessarily only those with independent recombination events at different loci. 

haploicLlowd provides a function to calculate C(£) from recombination rates between loci assuming a circular or 
linear chromosome. The probability of a particular crossover pattern is calculated assuming independent crossovers. 
The function 

int haploid_lowd :: set _recombination_rate ( double *rec_rates) 

assumes a double array of length L — 1 for a linear chromosome and of length L for a circular chromosome. For a 
linear (resp. circular) chromosome, the i-th element of the array is the probability of recombining after (resp. before) 
the i-th locus. Furthermore, the mating probability r must be specified explicitely via the attribute 

haploid_lowd : : outcrossing_rate 

the default is obligate sexual reproduction. 

The code offers a simpler alternative for free recombination. In this case, only the global mating probability r needs 
to be entered. If the user does not set the recombination rates via set_recombination_rate, free recombination is 
the default behaviour. Otherwise, this option is controlled by the following boolean attribute 

haploid_lowd : : free_recombination 

Note that, in a circular chromosome, there is effectively one more inter-locus segment (between the last and the 
first locus) in which crossovers can occur, and the total number of crossovers has to be even. Assuming independent 
crossovers, the global recombination rate of circular chromosomes is lower than a linear chromosome of the same 
length by a factor of (1 — e~ r °), where rn is the recombination rate between the first and last loci. 

The recombination process itself is initiated by 

int haploid_lowd : : recombine ( ) 



2. FFPopSim for many loci 

For more than 20 loci, storing then entire genotype space and all possible recombinants becomes prohibitive. Hence 
we also include a streamlined individual based simulation package that can simulate arbitrarily large number of loci 
and has an overall runtime and memory requirements 0{NL) in the worst case scenario. The many-loci package uses 
the same interface as the few-loci one. This makes it easy, for example, to first test an evolutionary scenario using 
many (all) loci and to focus on the few crucial ones afterwards. 

To speed up the program in many cases of interest, identical genotypes are grouped into clones. This part of the 
library was developed for whole genome simulations of large HIV populations (L — 10 4 , N > 10 5 ). A specific wrapper 
class for HIV applications is provided. 



a. Specification of the population and the fitness func 

For more than 20 loci, it becomes infeasible to store the entire hypercube. Instead, we store individual genotypes as 
bitsets. Each genotype, together with the number of individuals that carry it, as well as traits and fitness associated 
with it is stored for as long as it is present in the population. All of this is aggregated in the structure clone _t. The 
population is a vector of clones. Each generation clones size are updated and added to the new generation, new clones 
are produced, and empty ones deleted. 

Fitness functions are again functions on the hypercube. The latter is implemented as hypercubeJiighd. Instead 
of storing all possible fitness values, hypercubeJiighd stores non-zero Fourier coefficients. Whenever a new genotype 
is produced, its fitness is calculated by summing the appropriate coefficients. 



b. Mutation 



To implement mutation, a Poisson distributed number n with mean Nfii is drawn for each locus i and n mutations 
are introduced at locus i into n randomly chosen genotypes. Mutations are bit-flip operations in the bitset. Only a 
global mutation rate is currently supported. 



c. Selection 

Prior to selection, the population average W := (e F ) and a growth rate adjustment exp(l — N/N ) are computed. 
The latter is used to keep the population size close to the carrying capacity No. The size of each clone is then updated 

with a Poisson distributed number with mean (1 — r)W exp(l — N/Nq), where r is the recombination rate. Another 

Poisson distributed number with mean rW exp(l — N/Nq) is set aside for recombination later. 



d. Recombination 

The individuals marked for sexual reproduction during the selection step are shuffled and paired. For each pair, a 
bitset representing the crossover pattern is produced and two new clones are produced from the two parental genomes. 
Alternatively, all loci of the two genomes can be reassorted at random. 



3. Python wrapper 

The C++ library includes Python bindings that greatly simplify interactive use and testing. The wrapping itself is 
done by SWIG (iBeazley 2003). Most notably, the C++ classes haploicLlowd, haploidJiighd and the HIV-specific 



subclass are fully exposed to Python, including all their public members. The performance speed for evolving a 
population is unchanged, since the evolve function iterates all steps internally for an arbitary number of generations. 

The bindings are not completely faithful to the CH — h interface, to ensure a more intuitive user experience. For 
instance, C++ attribute set/get members are translated into Python properties via the builtin property construct. 
Furthermore, since direct access to the hypercube_lowd instances from Python is not straightforward, a few utility 
functions have been written to do common tasks. The fitness hypercube can be set easily by either one of 

haploid_lowd . set_fitness_function (genotypes , fitnesses ) 
haploid_lowd . set_fitness_additive (fitness_main) 

The former function is used to set specific points on the hypercube: it takes a sequence of genotypes genotypes 
(integers or binary literals using Ob notation) and a sequence of fitness values fitnesses, corresponding to those 
genotypes. Any missing point on the fitness hypercube will be consider neutral. The second function creates an 
additive fitness landscape, in which main effects are specified by the L-dimensional input sequence f itness_main. 
After installation, the FFPopSim library can be used in Python as a module, e.g. 

from FFPopSim import haploid_lowd 



The bindings make heavy use of the NumPy library and its SWIG fragments and typemaps ( Oliphant 2006 ) . We 
therefore recommend to import NumPy before FFPopSim, although this is not strictly necessary. Moreover, the 
Python binsings include a few functions for plotting features of the population, such as genetic diversity. The Python 



module Matplotlib is required for this purpose (Hunter 2007 1 



The HIV-specific part of the code has been expanded further in Python to enable quick simulations of viral evolution 
under commonly studied conditions. In particular, random genotype-phenotype maps for viral replication capacity 
and drug resistance can be generated automatically from a few parameters, via the functions 

hivpopulation . set_replication_landscape 
hivpopulation . set_resistance_landscape 

The input parameters reflect a number of typical properties of HIV populations, such as the fraction of sites carrying 
potentially adaptive mutations. See the inline Python documentation for further details on these functions. Moreover, 
since studies of HIV evolution often involve a large number of genotypic data, a function for saving the genotype of 
random individuals from the current population in a compressed format has been added. The syntax is the following: 

hivpopulation . write_genotypes_compressed( filename , number_of_individuals) 

where filename is the name of the file, in which the data are to be stored, and number_of .individuals if the size of 
the random sample. The data can be accessed again by the standard Numpy load function. 
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Appendix C: Examples 

1. haploid lowd: Valley crossing 

One of the most striking effects of genetic epistasis is the slowdown of evolution when a combination of mutations is 
beneficial, but intermediate mutants are deleterious compared to wildtype. Such scenario is relevant in applications, 



for instance for the emergence of bacterial or viral resistance to drugs ( Weinreich et al. 2005 1 . Not surprisingly, 
recombination plays a central role in this process. On the one hand, it enhances the production rate of multiple 
mutants, on the other it depletes the class of complete mutants by back-recombination with deleterious backgrounds. 
FFPopSim makes the efficient simulation of such processes as easy as the following script: 

import FFPopSim 

L = 4 # number of loci 

N = lelO # population size 

sl=le— 5 # fitness of wildtype 

s2=0.01 # fitness of quadruple mutant 

c = FFPopSim . haploid_lowd (L) # create population 
c . set_genotypes ( [0 bO ] , [N] ) # start with wildtype 

c . se t _r ecombinat ion_r at es ( [ 1 e — 2] * (L — 1)) # set recombination rates 
c . set _mut at ion_r at e ( 1 e — 5) # set mutation rate 

# assign positive relative fitness to wildtype and quadruple mutant 
c. set_fitness_function ([ObO, Obllll], [si, sl+s2]) 

# cross valley with an accuracy of 100 generations 
gens_at_once = 100 

while c . get_genotype_frequency (Obllll ) <0. 5 and c . generation <le7 : 
c.evolve(gens_at_once) 

# print result 

print 'Time to cross the valley: ' + str ( c . generation)-!- ' generations' 

If the script is run with different recombination rates, the effect of this parameter on the time for valley crossing 
can be investigated, as shown in Fig. [3] The full scripts producing the figures is provides as separate file. 
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Figure 3 Example of the Python wrapper for haploicLlowd. The average time for crossing a 4 dimensional fitness valley is 
plotted against the recombination rate. Bars indicate the standard deviation across fifty repetitions. Crossing a valley becomes 
essentially impossible once the total recombination rate exceeds the fitness benefit, at least when mutation rates are small. 
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2. haploid lowd: HIV immune escape 



During an HIV infection, the host immune system targets several viral epitopes simultaneously via a diverse arsenal 
of cytotoxic T-cells (CTLs). Mutations at several loci are thus selected for and start to rise in frequency at the same 
time but, because of the limited amount of recombination, end up in wasteful competition (interference) at frequencies 
of order one. 

The theoretical description of genetic interference is involved and often limited to two- loci models, but FFPopSim 
makes the simulation of this process straightforward. The following script evolves a 4-loci population under parallel 
positive selection and tracks its genotype frequencies: 

import FFPopSim 

L = 4 # number of loci 

N = lelO # population size 

c = FFPopSim . haploid_lowd (L) # create population 
c . set_genotypes ( [0 bO ] , [N] ) # start with wildtype 

c . se t _r ecombinat ion_r at es ( [ 1 e — 4] * (L — 1)) # set recombination rates 
c . set _mut at ion_r at e ( 1 e — 5) # set mutation rate 

pop . s e t _f i t n e s s _ad d i t i v e ( [ . 3 , 0.2, 0.1, 0.05]) # set an additive fitness landscape 

# evolve until fixation of the quadruple mutant, 

# storing times and genotype frequencies 
times = [ 

genotype_frequencies = [ 

while pop . get_genotype_frequency (Obllll ) < 0.99 and pop . generation <le7 : 
pop . evolve ( ) 

times .append (pop. generation) 

genotype .frequencies . append ( pop . get_genotype_frequencies ()) 

The resulting genotype frequencies are shown in Fig. |4j The full scripts producing the figures is provides as separate 
file. 
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Figure 4 Example of the Python wrapper for haploicLlowd. The frequencies of some genotypes in the population during 
strong, parallel positive selection are shown. 
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3. haploidJhighd: HIV chronic infection 

HIV evolution during chronic infection is determined by a number of parallel processes, such as mutation, recom- 
bination, and selection imposed by the immune system. In combination, these processes give rise to a complicated 
dynamics and don't understand how population features such as population diversity depend on model parameters. 
Hence simulations are an important source of insight. 

FFPopSim offers a specific double C++/Python interface to this problem via its class hivpopulation. The following 
script simulates an HIV population for one thousand generations, under a random static fitness landscape, and stores 
a hundred random genomes from the final population in a compressed file: 

import FFPopSim 

N = 1000 # population size 

# create the population . Default options : 

# — wildtype only 

# — no treatment, i.e. replication is the same as fitness 
pop = FFPopSim . hivpopulation (N) 

# set a random, additive replication capacity landscape using 

# the following parameters 

pop . set_replication_landscape (adaptive- fraction=0. 01, 

effect _size .adaptive = 0.03) 

# evolve the HIV population 
pop . evolve (1000) 

# store final genomes in compressed format for further analysis 
pop . write_genotypes_compressed( ' HIV -genomes . npz ' , 100) 

NumPy can be used subsequently to analyze the genome sequences. Alternatively, the internal Python functions can 
be used, e.g. for calculating the fitness distribution directly using hivpopulation. get_f itnessJiistogram, as shown 
in Fig. [5] The full scripts producing the figures is provides as separate file. 
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Figure 5 Example of the Python wrapper for hivpopulation. The fitness distribution of the population after 250, 500, 750, 
and 1000 generations from transmission is shown. 



